sequence representation
- North America > United States > Texas > Brazos County > College Station (0.14)
- North America > United States > Virginia (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- (4 more...)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.67)
- Energy (0.67)
- Government > Regional Government > North America Government > United States Government (0.46)
Hierarchical Dual-Head Model for Suicide Risk Assessment via MentalRoBERTa
Yang, Chang, Wang, Ziyi, Tan, Wangfeng, Tan, Zhiting, Ji, Changrui, Zhou, Zhiming
School of Artificial Intelligence Beijing University of Posts and T elecommunications Beijing, China ziyiwang2003@bupt.edu.cn Abstract--Social media platforms have become important sources for identifying suicide risk, but automated detection systems face multiple challenges including severe class imbalance, temporal complexity in posting patterns, and the dual nature of risk levels as both ordinal and categorical. This paper proposes a hierarchical dual-head neural network based on MentalRoBERT a for suicide risk classification into four levels: indicator, ideation, behavior, and attempt. The model employs two complementary prediction heads operating on a shared sequence representation: a CORAL (Consistent Rank Logits) head that preserves ordinal relationships between risk levels, and a standard classification head that enables flexible categorical distinctions. A 3-layer Transformer encoder with 8-head multi-head attention models temporal dependencies across post sequences, while explicit time interval embeddings capture posting behavior dynamics. The model is trained with a combined loss function (0.5 CORAL + 0.3 Cross-Entropy + 0.2 Focal Loss) that simultaneously addresses ordinal structure preservation, overconfidence reduction, and class imbalance. T o improve computational efficiency, we freeze the first 6 layers (50%) of MentalRoBERT a and employ mixed-precision training. The model is evaluated using 5-fold stratified cross-validation with macro F1 score as the primary metric.
- Asia > China > Beijing > Beijing (0.44)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Asia > China > Tianjin Province > Tianjin (0.04)
- (2 more...)
- North America > United States > Texas > Brazos County > College Station (0.14)
- North America > United States > Virginia (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- (4 more...)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.67)
- Energy (0.67)
- Government > Regional Government > North America Government > United States Government (0.46)
- Government > Military (0.46)
DenseRec: Revisiting Dense Content Embeddings for Sequential Transformer-based Recommendation
Lichtenberg, Jan Malte, De Candia, Antonio, Ruffini, Matteo
Transformer-based sequential recommenders, such as SASRec or BERT4Rec, typically rely solely on learned item ID embeddings, making them vulnerable to the item cold-start problem, particularly in environments with dynamic item catalogs. While dense content embeddings from pre-trained models offer potential solutions, direct integration into transformer-based recommenders has consistently underperformed compared to ID-only approaches. We revisit this integration challenge and propose DenseRec, a simple yet effective method that introduces a dual-path embedding approach. DenseRec learns a linear projection from the dense embedding space into the ID embedding space during training, enabling seamless generalization to previously unseen items without requiring specialized embedding models or complex infrastructure. In experiments on three real-world datasets, we find DenseRec to consistently outperform an ID-only SASRec baseline, even without additional hyperparameter tuning and while using compact embedding models. Our analysis suggests improvements primarily arise from better sequence representations in the presence of unseen items, positioning DenseRec as a practical and robust solution for cold-start sequential recommendation.
La-Proteina: Atomistic Protein Generation via Partially Latent Flow Matching
Geffner, Tomas, Didi, Kieran, Cao, Zhonglin, Reidenbach, Danny, Zhang, Zuobai, Dallago, Christian, Kucukbenli, Emine, Kreis, Karsten, Vahdat, Arash
Recently, many generative models for de novo protein structure design have emerged. Yet, only few tackle the difficult task of directly generating fully atomistic structures jointly with the underlying amino acid sequence. This is challenging, for instance, because the model must reason over side chains that change in length during generation. We introduce La-Proteina for atomistic protein design based on a novel partially latent protein representation: coarse backbone structure is modeled explicitly, while sequence and atomistic details are captured via per-residue latent variables of fixed dimensionality, thereby effectively side-stepping challenges of explicit side-chain representations. Flow matching in this partially latent space then models the joint distribution over sequences and full-atom structures. La-Proteina achieves state-of-the-art performance on multiple generation benchmarks, including all-atom co-designability, diversity, and structural validity, as confirmed through detailed structural analyses and evaluations. Notably, La-Proteina also surpasses previous models in atomistic motif scaffolding performance, unlocking critical atomistic structure-conditioned protein design tasks. Moreover, La-Proteina is able to generate co-designable proteins of up to 800 residues, a regime where most baselines collapse and fail to produce valid samples, demonstrating La-Proteina's scalability and robustness.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- North America > Canada > Quebec > Montreal (0.04)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Education > Health & Safety > School Nutrition (0.35)
Robust signal decompositions on the circle
Imagine an agent moving along a circular path in the plane with some stationary landmarks, whose number and exact locations are unknown to the agent. Suppose that each landmark transmits an omnidirectional signal with a finite range, which we can model as a function that equals 1 inside a circular disk centered at the landmark and 0 outside. The boundaries of these disks, whose radii are in general different, may intersect the agent's path at one or two points or not at all. As the agent moves along its path, it can perceive these signals and so it knows, at each point, the number of landmarks that are within range. It cannot, however, identify different landmarks by their signals, and neither can it discern anything about each signal's strength other than its presence or absence. The agent's knowledge of its position on the circle may also not be precise, and the signal transmissions or measurements may occur with some sampling frequency rather than continuously in time. For these reasons, all that the agent can reliably reconstruct is a sequence of nonnegative integers corresponding to local landmark counts around the circle, and it may not be sure of the precise count at the exact points where this count changes. In this scenario, we want to pose the following questions: Can the agent figure out the total number of landmarks (excluding, of course, those whose signals do not reach any points on the circle)?
- North America > United States > Illinois > Champaign County > Urbana (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Asia > Middle East > Republic of Türkiye (0.04)
Are Information Retrieval Approaches Good at Harmonising Longitudinal Survey Questions in Social Science?
Li, Wing Yan, Wang, Zeqiang, Johnson, Jon, De, Suparna
Automated detection of semantically equivalent questions in longitudinal social science surveys is crucial for long-term studies informing empirical research in the social, economic, and health sciences. Retrieving equivalent questions faces dual challenges: inconsistent representation of theoretical constructs (i.e. concept/sub-concept) across studies as well as between question and response options, and the evolution of vocabulary and structure in longitudinal text. To address these challenges, our multi-disciplinary collaboration of computer scientists and survey specialists presents a new information retrieval (IR) task of identifying concept (e.g. Housing, Job, etc.) equivalence across question and response options to harmonise longitudinal population studies. This paper investigates multiple unsupervised approaches on a survey dataset spanning 1946-2020, including probabilistic models, linear probing of language models, and pre-trained neural networks specialised for IR. We show that IR-specialised neural models achieve the highest overall performance with other approaches performing comparably. Additionally, the re-ranking of the probabilistic model's results with neural models only introduces modest improvements of 0.07 at most in F1-score. Qualitative post-hoc evaluation by survey specialists shows that models generally have a low sensitivity to questions with high lexical overlap, particularly in cases where sub-concepts are mismatched. Altogether, our analysis serves to further research on harmonising longitudinal studies in social science.
- Europe > Italy (0.05)
- Europe > United Kingdom > England > Surrey (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- (5 more...)
Invariant Tokenization of Crystalline Materials for Language Model Enabled Generation
Yan, Keqiang, Li, Xiner, Ling, Hongyi, Ashen, Kenna, Edwards, Carl, Arróyave, Raymundo, Zitnik, Marinka, Ji, Heng, Qian, Xiaofeng, Qian, Xiaoning, Ji, Shuiwang
We consider the problem of crystal materials generation using language models (LMs). A key step is to convert 3D crystal structures into 1D sequences to be processed by LMs. Prior studies used the crystallographic information framework (CIF) file stream, which fails to ensure SE(3) and periodic invariance and may not lead to unique sequence representations for a given crystal structure. Here, we propose a novel method, known as Mat2Seq, to tackle this challenge. Mat2Seq converts 3D crystal structures into 1D sequences and ensures that different mathematical descriptions of the same crystal are represented in a single unique sequence, thereby provably achieving SE(3) and periodic invariance. Experimental results show that, with language models, Mat2Seq achieves promising performance in crystal structure generation as compared with prior methods.
- North America > United States > Texas > Brazos County > College Station (0.14)
- North America > United States > Virginia (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- (4 more...)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.67)
- Energy (0.67)
- Government > Regional Government > North America Government > United States Government (0.46)
EvoLlama: Enhancing LLMs' Understanding of Proteins via Multimodal Structure and Sequence Representations
Liu, Nuowei, Sun, Changzhi, Ji, Tao, Tian, Junfeng, Tang, Jianxin, Wu, Yuanbin, Lan, Man
Current Large Language Models (LLMs) for understanding proteins primarily treats amino acid sequences as a text modality. Meanwhile, Protein Language Models (PLMs), such as ESM-2, have learned massive sequential evolutionary knowledge from the universe of natural protein sequences. Furthermore, structure-based encoders like ProteinMPNN learn the structural information of proteins through Graph Neural Networks. However, whether the incorporation of protein encoders can enhance the protein understanding of LLMs has not been explored. To bridge this gap, we propose EvoLlama, a multimodal framework that connects a structure-based encoder, a sequence-based protein encoder and an LLM for protein understanding. EvoLlama consists of a ProteinMPNN structure encoder, an ESM-2 protein sequence encoder, a multimodal projector to align protein and text representations and a Llama-3 text decoder. To train EvoLlama, we fine-tune it on protein-oriented instructions and protein property prediction datasets verbalized via natural language instruction templates. Our experiments show that EvoLlama's protein understanding capabilities have been significantly enhanced, outperforming other fine-tuned protein-oriented LLMs in zero-shot settings by an average of 1%-8% and surpassing the state-of-the-art baseline with supervised fine-tuning by an average of 6%. On protein property prediction datasets, our approach achieves promising results that are competitive with state-of-the-art task-specific baselines. We will release our code in a future version.
- Asia > China (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
Contrastive Representation Learning for Predicting Solar Flares from Extremely Imbalanced Multivariate Time Series Data
Vural, Onur, Hamdi, Shah Muhammad, Boubrahimi, Soukaina Filali
Major solar flares are abrupt surges in the Sun's magnetic flux, presenting significant risks to technological infrastructure. In view of this, effectively predicting major flares from solar active region magnetic field data through machine learning methods becomes highly important in space weather research. Magnetic field data can be represented in multivariate time series modality where the data displays an extreme class imbalance due to the rarity of major flare events. In time series classification-based flare prediction, the use of contrastive representation learning methods has been relatively limited. In this paper, we introduce CONTREX, a novel contrastive representation learning approach for multivariate time series data, addressing challenges of temporal dependencies and extreme class imbalance. Our method involves extracting dynamic features from the multivariate time series instances, deriving two extremes from positive and negative class feature vectors that provide maximum separation capability, and training a sequence representation embedding module with the original multivariate time series data guided by our novel contrastive reconstruction loss to generate embeddings aligned with the extreme points. These embeddings capture essential time series characteristics and enhance discriminative power. Our approach shows promising solar flare prediction results on the Space Weather Analytics for Solar Flares (SWAN-SF) multivariate time series benchmark dataset against baseline methods.
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- North America > United States > Utah > Cache County > Logan (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- (2 more...)